Simultaneous model selection via rate-distortion theory, with applications to cluster and significance analysis of gene expression data
نویسنده
چکیده
High-dimensional data are prevalent across many application areas, and generate an everincreasing demand for statistical methods of dimension reduction, such as cluster and significance analysis. One application area that has recently received much interest is the analysis of microarray gene expression data. The results of cluster analysis are open to subjective interpretation. To facilitate the objective inference of such analyses, we use flexible parameterizations of the cluster means, paired with subset model selection, to generate sparse and easy-to-interpret representations of each cluster. Model selection in clustering is combinatorial in the numbers of clusters and experimental conditions, and thus presents a computationally challenging task. In this paper we introduce a simultaneous approach to subset model selection, applicable to both model selection in cluster and significance analysis. Our approach draws on results from rate-distortion theory, and allows us to turn the combinatorial model selection problem into a fast and simple line search. We show that simultaneous cluster model selection generates objectively interpretable models, and that the selection performance is competitive with a combinatorial search, at a fraction of the computational cost. Moreover, we show that the rate-distortion based significance analysis substantially increases the power compared with standard methods. ∗contact: [email protected], fax +1 732 445-3428, telephone +1 732 445-3145
منابع مشابه
Feature Selection and Classification of Microarray Gene Expression Data of Ovarian Carcinoma Patients using Weighted Voting Support Vector Machine
We can reach by DNA microarray gene expression to such wealth of information with thousands of variables (genes). Analysis of this information can show genetic reasons of disease and tumor differences. In this study we try to reduce high-dimensional data by statistical method to select valuable genes with high impact as biomarkers and then classify ovarian tumor based on gene expression data of...
متن کاملO-35: Over-Expression of XRCC1 As Potential Biomarker for Poor Prognosis in Human Preimplantation Embryos: Selection by Study of 84 Genes Involved in DNA Damage Signaling Pathways
Background: Chromosome abnormalities are associated with poor morphology and development in human preimplantation embryos, all together lead to poor outcomes. This study aimed to explore altered expression of DNA damage pathways in “poor morphological and development embryos with sever aneuploidies”. Materials and Methods: Surplus day-4 embryos of PGD cases were pooled in two groups: Poor progn...
متن کاملModification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis
Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...
متن کاملFinding the number of clusters in a data set : An information theoretic approach
One of the most difficult problems in cluster analysis is the identification of the number of groups in a data set. Most previously suggested approaches to this problem are either somewhat ad hoc or require parametric assumptions and complicated calculations. In this paper we develop a simple yet powerful non-parametric method for choosing the number of clusters based on distortion, a quantity ...
متن کاملAdenine molecule interacting with golden nanocluster: A dispersion corrected DFT study
The interaction between nanoparticles and biomolecules such as protein andDNA is one of the major instructions of nanobiotechnology research. In this study,we have explored the interaction of adenine nucleic base with a representativegolden cluster (Au13) by using dispersion corrected density functional theory(DFT-D3) within GGA-PBE model of theory. Various active sites ...
متن کامل